Goto

Collaborating Authors

 audio file




Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

Bian, Wesley, Lin, Xiaofeng, Cheng, Guang

arXiv.org Artificial Intelligence

Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.



From Sound to Setting: AI-Based Equalizer Parameter Prediction for Piano Tone Replication

Yu, Song-Ze

arXiv.org Artificial Intelligence

Abstract--This project presents an AI-based system for tone replication in music production, focusing on predicting EQ parameter settings directly from audio features. Unlike traditional audio-to-audio methods, our approach generates interpretable parameter values--such as EQ band gains--that musicians can further adjust in their workflow. Using a dataset of piano recordings with systematically varied EQ settings, we evaluate both regression and neural network models. Results show that our neural network model achieves highly accurate parameter predictions, with a mean squared error of 0.0216 on multi-band tasks. The proposed system enables practical, flexible, and automated tone matching for music producers, laying the foundation for future extensions to more complex audio effects.




Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

Mei, Katelyn Xiaoying, Choi, Anna Seo Gyeong, Schellmann, Hilke, Sloane, Mona, Koenecke, Allison

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems' performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric -- the Word Error Rate -- which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.


Hallucination Level of Artificial Intelligence Whisperer: Case Speech Recognizing Pantterinousut Rap Song

Horppu, Ismo, Ayala, Frederick, Gulbenkoglu, Erlin

arXiv.org Artificial Intelligence

All languages are peculiar. Some of them are considered more challenging to understand than others. The Finnish Language is known to be a complex language. Also, when languages are used by artists, the pronunciation and meaning might be more tricky to understand. Therefore, we are putting AI to a fun, yet challenging trial: translating a Finnish rap song to text. We will compare the Faster Whisperer algorithm and YouTube's internal speech-to-text functionality. The reference truth will be Finnish rap lyrics, which the main author's little brother, Mc Timo, has written. Transcribing the lyrics will be challenging because the artist raps over synth music player by Syntikka Janne. The hallucination level and mishearing of AI speech-to-text extractions will be measured by comparing errors made against the original Finnish lyrics. The error function is informal but still works for our case.


SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

Alex, Tony, Ahmed, Sara, Mustafa, Armin, Awais, Muhammad, Jackson, Philip JB

arXiv.org Artificial Intelligence

Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the self-supervised pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio self-supervised learning (SSL) methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce S elf-S upervised L earning from A udio M ixtures (SSLAM), a novel direction in audio SSL research, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against state-of-the-art (SOT A) methods using a range of high-quality, publicly available polyphonic datasets. SS-LAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9% improvement on the AudioSet-2M(AS-2M), reaching a mean average precision (mAP) of 50.2. These results demonstrate SSLAM's effectiveness in both polyphonic and monophonic soundscapes, significantly enhancing the performance of audio SSL models. Code and pre-trained models are available at https://github.com/ta012/SSLAM .